41 research outputs found

    Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems

    Get PDF
    Background: Variable selection on high throughput biological data, such as gene expression or single nucleotide polymorphisms (SNPs), becomes inevitable to select relevant information and, therefore, to better characterize diseases or assess genetic structure. There are different ways to perform variable selection in large data sets. Statistical tests are commonly used to identify differentially expressed features for explanatory purposes, whereas Machine Learning wrapper approaches can be used for predictive purposes. In the case of multiple highly correlated variables, another option is to use multivariate exploratory approaches to give more insight into cell biology, biological pathways or complex traits.Results: A simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework.Conclusions: sPLS-DA has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets. More importantly, sPLS-DA is clearly competitive in terms of computational efficiency and superior in terms of interpretability of the results via valuable graphical outputs. sPLS-DA is available in the R package mixOmics, which is dedicated to the analysis of large biological data sets

    Integrative analysis of gene expression and copy number alterations using canonical correlation analysis

    Get PDF
    Supplementary Figure 1. Representation of the samples from the tuning set by their coordinates in the first two pairs of features (extracted from the tuning set) using regularized dual CCA, with regularization parameters tx = 0.9, ty = 0.3 (left panel), and PCA+CCA (right panel). We show the representations with respect to both the copy number features and the gene expression features in a superimposed way, where each sample is represented by two markers. The filled markers represent the coordinates in the features extracted from the copy number variables, and the open markers represent coordinates in the features extracted from the gene expression variables. Samples with different leukemia subtypes are shown with different colors. The first feature pair distinguishes the HD50 group from the rest, while the second feature pair represents the characteristics of the samples from the E2A/PBX1 subtype. The high canonical correlation obtained for the tuning samples with regularized dual CCA is apparent in the left panel, where the two points for each sample coincide. Nevertheless, the extracted features have a high generalization ability, as can be seen in the left panel of Figure 5, showing the representation of the validation samples. 1 Supplementary Figure 2. Representation of the samples from the tuning set by their coordinates in the first two pairs of features (extracted from the tuning set) using regularized dual CCA, with regularization parameters tx = 0, ty = 0 (left panel), and tx = 1, ty = 1 (right panel). We show the representations with respect to both the copy number features and the gene expression features in a superimposed way, where each sample is represented by tw

    Pharmacogenetics Meets Metabolomics: Discovery of Tryptophan as a New Endogenous OCT2 Substrate Related to Metformin Disposition

    Get PDF
    Genetic polymorphisms of the organic cation transporter 2 (OCT2), encoded by SLC22A2, have been investigated in association with metformin disposition. A functional decrease in transport function has been shown to be associated with the OCT2 variants. Using metabolomics, our study aims at a comprehensive monitoring of primary metabolite changes in order to understand biochemical alteration associated with OCT2 polymorphisms and discovery of potential endogenous metabolites related to the genetic variation of OCT2. Using GC-TOF MS based metabolite profiling, clear clustering of samples was observed in Partial Least Square Discriminant Analysis, showing that metabolic profiles were linked to the genetic variants of OCT2. Tryptophan and uridine presented the most significant alteration in SLC22A2-808TT homozygous and the SLC22A2-808G>T heterozygous variants relative to the reference. Particularly tryptophan showed gene-dose effects of transporter activity according to OCT2 genotypes and the greatest linear association with the pharmacokinetic parameters (Clrenal, Clsec, Cl/F/kg, and Vd/F/kg) of metformin. An inhibition assay demonstrated the inhibitory effect of tryptophan on the uptake of 1-methyl-4-phenyl pyrinidium in a concentration dependent manner and subsequent uptake experiment revealed differential tryptophan-uptake rate in the oocytes expressing OCT2 reference and variant (808G>T). Our results collectively indicate tryptophan can serve as one of the endogenous substrate for the OCT2 as well as a biomarker candidate indicating the variability of the transport activity of OCT2

    Visualising associations between paired 'omics' data sets

    Get PDF
    Background: Each omics platform is now able to generate a large amount of data. Genomics, proteomics, metabolomics, interactomics are compiled at an ever increasing pace and now form a core part of the fundamental systems biology framework. Recently, several integrative approaches have been proposed to extract meaningful information. However, these approaches lack of visualisation outputs to fully unravel the complex associations between different biological entities

    A novel approach for biomarker selection and the integration of repeated measures experiments from two assays

    Get PDF
    Background: High throughput 'omics' experiments are usually designed to compare changes observed between different conditions (or interventions) and to identify biomarkers capable of characterizing each condition. We consider the complex structure of repeated measurements from different assays where different conditions are applied on the same subjects

    Sparse canonical methods for biological data integration: application to a cross-platform study

    Get PDF
    Background: In the context of systems biology, few sparse approaches have been proposed so far to integrate several data sets. It is however an important and fundamental issue that will be widely encountered in post genomic studies, when simultaneously analyzing transcriptomics, proteomics and metabolomics data using different platforms, so as to understand the mutual interactions between the different data sets. In this high dimensional setting, variable selection is crucial to give interpretable results. We focus on a sparse Partial Least Squares approach (sPLS) to handle two-block data sets, where the relationship between the two types of variables is known to be symmetric. Sparse PLS has been developed either for a regression or a canonical correlation framework and includes a built-in procedure to select variables while integrating data. To illustrate the canonical mode approach, we analyzed the NCI60 data sets, where two different platforms (cDNA and Affymetrix chips) were used to study the transcriptome of sixty cancer cell lines
    corecore